# Long - video event capture
Fastvlm 0.5B Stage3
Other
FastVLM-0.5B-Stage3 is an efficient multimodal language model with visual understanding and language processing capabilities. It can process long videos and generate structured outputs.
Image-to-Text
Transformers English

F
zhaode
174
1
Fastvlm 0.5B Stage2
Other
FastVLM-0.5B-Stage2 is an efficient multimodal language model capable of understanding visual content and handling text tasks.
Multimodal Fusion
Transformers English

F
zhaode
103
1
Qwen2.5 VL 32B Instruct Exl2 4 25bpw
Apache-2.0
Qwen2.5-VL-32B-Instruct is the latest vision - language model in the Qwen family, with powerful multimodal understanding and generation capabilities, supporting the interaction of images, videos, and text.
Text-to-Image
Transformers English

Q
christopherthompson81
68
3
Featured Recommended AI Models